Abstract
Despite recent progress in vision-language models (VLMs), holisticunderstanding of long-form video content remains a significant challenge,partly due to limitations in current benchmarks. Many focus on peripheral,``needle-in-a-haystack'' details, encouraging context-insensitive retrievalover deep comprehension. Others rely on large-scale, semi-automaticallygenerated questions (often produced by language models themselves) that areeasier for models to answer but fail to reflect genuine understanding. In thispaper, we introduce MF$^2$, a new benchmark for evaluating whether models cancomprehend, consolidate, and recall key narrative information from full-lengthmovies (50-170 minutes long). MF$^2$ includes over 50 full-length,open-licensed movies, each paired with manually constructed sets of claim pairs-- one true (fact) and one plausible but false (fib), totalling over 850 pairs.These claims target core narrative elements such as character motivations andemotions, causal chains, and event order, and refer to memorable moments thathumans can recall without rewatching the movie. Instead of multiple-choiceformats, we adopt a binary claim evaluation protocol: for each pair, modelsmust correctly identify both the true and false claims. This reduces biaseslike answer ordering and enables a more precise assessment of reasoning. Ourexperiments demonstrate that both open-weight and closed state-of-the-artmodels fall well short of human performance, underscoring the relative ease ofthe task for humans and their superior ability to retain and reason overcritical narrative information -- an ability current VLMs lack.